We are collecting a dataset on water quality to train a machine learning model for binary classification: determining whether water is safe for consumption (1) or not (0). This model will help with water treatment decisions and ensure compliance with quality standards. We applied different Summarzition and plotting methods to help us to understand our dataset, such as scatter, histogram and bar plot. Then, we applyed preprocess in our data using data cleaning, data transformation and feature selection.
1-classification in this dataset is to build a predictive model that can classify water samples into two categories: potable (suitable for consumption) or non-potable (not suitable for consumption).
2-clustering in this dataset is to identify natural groupings or clusters within the water samples based on their quality parameters.
Kaggle
https://www.kaggle.com/datasets/uom190346a/water-quality-and-potability
#install.packages("caret")
#install.packages("glmnet")
#install.packages("Boruta")
#install.packages("mlbench")
#install.packages("randomForest")
library(cluster)
library(factoextra)
library(outliers)
library(dplyr)
library(mlbench)
library(caret)
library(glmnet)
library(Boruta)
library(ggplot2)
library(randomForest)
library(pROC)
library(e1071)
library(caret)
library(party)
library(partykit)
library(RWeka)
library(C50)
library(printr)
library(rpart)
library(rpart.plot)
getwd()
[1] "/Users/mahayie/Documents/GitHub/DM1Project"
#setwd("/Users/mahayie/Desktop/326p")
#getwd()
water_potability = read.csv('Dataset/water_potability.csv')
str(water_potability)
'data.frame': 3276 obs. of 10 variables:
$ ph : num NA 3.72 8.1 8.32 9.09 ...
$ Hardness : num 205 129 224 214 181 ...
$ Solids : num 20791 18630 19910 22018 17979 ...
$ Chloramines : num 7.3 6.64 9.28 8.06 6.55 ...
$ Sulfate : num 369 NA NA 357 310 ...
$ Conductivity : num 564 593 419 363 398 ...
$ Organic_carbon : num 10.4 15.2 16.9 18.4 11.6 ...
$ Trihalomethanes: num 87 56.3 66.4 100.3 32 ...
$ Turbidity : num 2.96 4.5 3.06 4.63 4.08 ...
$ Potability : int 0 0 0 0 0 0 0 0 0 0 ...
sample of raw dataset(first 10 rows):
head(water_potability,10)
sample of raw dataset(last 10 rows):
tail(water_potability, 10)
Five number summary of each attribute in our dataset:
summary(water_potability)
ph Hardness Solids Chloramines Sulfate Conductivity
Min. : 0.000 Min. : 47.43 Min. : 320.9 Min. : 0.352 Min. :129.0 Min. :181.5
1st Qu.: 6.093 1st Qu.:176.85 1st Qu.:15666.7 1st Qu.: 6.127 1st Qu.:307.7 1st Qu.:365.7
Median : 7.037 Median :196.97 Median :20927.8 Median : 7.130 Median :333.1 Median :421.9
Mean : 7.081 Mean :196.37 Mean :22014.1 Mean : 7.122 Mean :333.8 Mean :426.2
3rd Qu.: 8.062 3rd Qu.:216.67 3rd Qu.:27332.8 3rd Qu.: 8.115 3rd Qu.:360.0 3rd Qu.:481.8
Max. :14.000 Max. :323.12 Max. :61227.2 Max. :13.127 Max. :481.0 Max. :753.3
NA's :491 NA's :781
Organic_carbon Trihalomethanes Turbidity Potability
Min. : 2.20 Min. : 0.738 Min. :1.450 Min. :0.0000
1st Qu.:12.07 1st Qu.: 55.845 1st Qu.:3.440 1st Qu.:0.0000
Median :14.22 Median : 66.622 Median :3.955 Median :0.0000
Mean :14.28 Mean : 66.396 Mean :3.967 Mean :0.3901
3rd Qu.:16.56 3rd Qu.: 77.337 3rd Qu.:4.500 3rd Qu.:1.0000
Max. :28.30 Max. :124.000 Max. :6.739 Max. :1.0000
NA's :162
dim(water_potability)
[1] 3276 10
This is a sample of the dataset to help to understand how it is structured and organized
View(water_potability)
sample(water_potability)
The absence of data in certain variables or columns in a dataset is referred to as missing or null values due to various reasons. It can have a negative impact on the dataset’s efficiency and the information that can be taken from it later, so we checked to see whether our data had missing or null values and eliminated these rows to produce a more efficient dataset.
first we checked for missing value to ensures accurate statistics, reliable visualizations, and guides decisions on imputation or removal of missing data.
dim(water_potability)
[1] 3276 10
sum(is.na(water_potability))
[1] 1434
colSums(is.na(water_potability))
ph Hardness Solids Chloramines Sulfate Conductivity Organic_carbon
491 0 0 0 781 0 0
Trihalomethanes Turbidity Potability
162 0 0
water_potability = na.omit(water_potability)
colSums(is.na(water_potability))
ph Hardness Solids Chloramines Sulfate Conductivity Organic_carbon
0 0 0 0 0 0 0
Trihalomethanes Turbidity Potability
0 0 0
View(water_potability)
The standard deviation in statistics is a measure used to assess the spread of data around the mean. It gives us an idea of how much the data points deviate from the average.
sd(water_potability$Turbidity)
[1] 0.7803462
sd(water_potability$Solids)
[1] 8642.24
sd(water_potability$Conductivity)
[1] 80.71257
sd(water_potability$Organic_carbon)
[1] 3.324959
sd(water_potability$ph)
[1] 1.573337
we ues it for five coulme Turbidity,Solids,Conductivity,Organic_carbon,ph.
the average, is a measure of central tendency in statistics. It is calculated by summing up all the values in a dataset and dividing by the number of values. The mean gives us a representative value that is typically used to describe the “typical” value in a set of data.
mean(water_potability$Turbidity)
[1] 3.969729
mean(water_potability$Solids)
[1] 21917.44
mean(water_potability$Conductivity)
[1] 426.5264
mean(water_potability$Organic_carbon)
[1] 14.35771
mean(water_potability$ph)
[1] 7.08599
we ues it for five coulme Turbidity,Solids,Conductivity,Organic_carbon,ph.
It represents the middle value in a dataset when the values are arranged in ascending or descending order.
median(water_potability$Turbidity)
[1] 3.968177
median(water_potability$Solids)
[1] 20933.51
median(water_potability$Conductivity)
[1] 423.4559
median(water_potability$Organic_carbon)
[1] 14.32202
median(water_potability$ph)
[1] 7.027297
we ues it for five coulme Turbidity,Solids,Conductivity,Organic_carbon,ph.
It provides information about how far each value in the dataset is from the mean. A higher variance indicates a greater spread of data, while a lower variance suggests that the data points are closer to the mean.
var(water_potability$Turbidity)
[1] 0.6089401
var(water_potability$Solids)
[1] 74688309
var(water_potability$Conductivity)
[1] 6514.519
var(water_potability$Organic_carbon)
[1] 11.05535
var(water_potability$ph)
[1] 2.475388
As you can see the highest Variance Solids, and the lowest Turbidity.
With using minimum, maximum, mean, median laws it helps to provide an overview of the data’s key characteristics
summary(water_potability$Conductivity)
Min. 1st Qu. Median Mean 3rd Qu. Max.
201.6 366.7 423.5 426.5 482.4 753.3
summary(water_potability$Organic_carbon)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.20 12.12 14.32 14.36 16.68 27.01
summary(water_potability$Hardness)
Min. 1st Qu. Median Mean 3rd Qu. Max.
73.49 176.74 197.19 195.97 216.44 317.34
summary(water_potability$Solids)
Min. 1st Qu. Median Mean 3rd Qu. Max.
320.9 15615.7 20933.5 21917.4 27182.6 56488.7
summary(water_potability$Chloramines)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.391 6.139 7.144 7.134 8.110 13.127
summary(water_potability$Potability)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.0000 0.0000 0.4033 1.0000 1.0000
summary(water_potability$Sulfate)
Min. 1st Qu. Median Mean 3rd Qu. Max.
129.0 307.6 332.2 333.2 359.3 481.0
summary(water_potability$Trihalomethanes)
Min. 1st Qu. Median Mean 3rd Qu. Max.
8.577 55.953 66.542 66.401 77.292 124.000
summary(water_potability$Turbidity)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.450 3.443 3.968 3.970 4.514 6.495
summary(water_potability$ph)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.2275 6.0897 7.0273 7.0860 8.0530 14.0000
This step involved transforming the class label, Potability, into categorical data. We changed the numeric data to ‘Not Potable’ and ‘Potable’ to indicate whether the water is safe for human consumption, where 1 represents ‘Potable’, and 0 represents ’Not Potable.
water_potability$Potability[water_potability$Potability == '0'] <- 'Not Potable'
water_potability$Potability[water_potability$Potability == '1'] <- 'Potable'
water_potability$Potability <- as.factor(water_potability$Potability)
table(water_potability$Potability)
Not Potable Potable
1200 811
print(water_potability)
They are observations that lie far away from the majority of the data. Outliers can occur due to various reasons such as measurement errors, experimental anomalies, or genuine extreme values.
##before removing outlier:
dim(water_potability)
[1] 2011 10
head(water_potability)
Removing outliers from a dataset is critical for assuring the quality and reliability of statistical analysis and machine learning models. We found all outliers in the numerical attributes and subsequently eliminated the rows containing the outliers.
summary(water_potability$ph)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.2275 6.0897 7.0273 7.0860 8.0530 14.0000
quartiles <- quantile(water_potability$ph, probs = c(.25, .75), na.rm = FALSE)
quartiles
25% 75%
6.089723 8.052969
iqr <- IQR(water_potability$ph)
iqr
[1] 1.963245
lower <- quartiles[1] - 1.5*iqr
lower
25%
3.144855
upper <- quartiles[2] + 1.5*iqr
upper
75%
10.99784
boxplot(ph ~ Potability, data = water_potability)
repeat {
out_val <- boxplot(water_potability$ph, ylab = 'ph')$out
out_val
out_rows <- which(water_potability$ph %in% c(out_val))
out_rows
if(sum(out_rows) > 0) water_potability <- water_potability[-out_rows,]
else {break}
}
summary(water_potability$ph)
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.231 6.105 7.027 7.087 8.030 10.905
#-------------------------------------------
summary(water_potability$Hardness)
Min. 1st Qu. Median Mean 3rd Qu. Max.
73.49 176.90 197.36 196.27 216.44 317.34
quartiles <- quantile(water_potability$Hardness, probs = c(.25, .75), na.rm = FALSE)
quartiles
25% 75%
176.9031 216.4411
iqr <- IQR(water_potability$Hardness)
iqr
[1] 39.53799
lower <- quartiles[1] - 1.5*iqr
lower
25%
117.5961
upper <- quartiles[2] + 1.5*iqr
upper
75%
275.7481
boxplot(Hardness ~ Potability, data = water_potability)
repeat {
out_val <- boxplot(water_potability$Hardness, ylab = 'Hardness')$out
out_val
out_rows <- which(water_potability$Hardness %in% c(out_val))
out_rows
if(sum(out_rows) > 0) water_potability <- water_potability[-out_rows,]
else {break}
}
summary(water_potability$Hardness)
Min. 1st Qu. Median Mean 3rd Qu. Max.
121.0 177.7 197.3 196.2 215.5 272.1
#-------------------------------------------
summary(water_potability$Solids)
Min. 1st Qu. Median Mean 3rd Qu. Max.
320.9 15704.5 20855.3 21840.2 27045.9 56488.7
quartiles <- quantile(water_potability$Solids, probs = c(.25, .75), na.rm = FALSE)
quartiles
25% 75%
15704.48 27045.93
iqr <- IQR(water_potability$Solids)
iqr
[1] 11341.45
lower <- quartiles[1] - 1.5*iqr
lower
25%
-1307.69
upper <- quartiles[2] + 1.5*iqr
upper
75%
44058.1
boxplot(Solids ~ Potability, data = water_potability)
repeat {
out_val <- boxplot(water_potability$Solids, ylab = 'Solids')$out
out_val
out_rows <- which(water_potability$Solids %in% c(out_val))
out_rows
if(sum(out_rows) > 0) water_potability <- water_potability[-out_rows,]
else {break}
}
summary(water_potability$Solids)
Min. 1st Qu. Median Mean 3rd Qu. Max.
320.9 15547.5 20518.7 21419.6 26734.7 43195.5
#-------------------------------------------
summary(water_potability$Chloramines)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.391 6.141 7.135 7.135 8.094 13.127
quartiles <- quantile(water_potability$Chloramines, probs = c(.25, .75), na.rm = FALSE)
quartiles
25% 75%
6.141236 8.094323
iqr <- IQR(water_potability$Chloramines)
iqr
[1] 1.953087
lower <- quartiles[1] - 1.5*iqr
lower
25%
3.211605
upper <- quartiles[2] + 1.5*iqr
upper
75%
11.02395
boxplot(Chloramines ~ Potability, data = water_potability)
repeat {
out_val <- boxplot(water_potability$Chloramines, ylab = 'Chloramines')$out
out_val
out_rows <- which(water_potability$Chloramines %in% c(out_val))
out_rows
if(sum(out_rows) > 0) water_potability <- water_potability[-out_rows,]
else {break}
}
summary(water_potability$Chloramines)
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.352 6.181 7.137 7.136 8.076 10.897
#-------------------------------------------
summary(water_potability$Sulfate)
Min. 1st Qu. Median Mean 3rd Qu. Max.
187.2 308.2 332.6 333.4 358.3 481.0
quartiles <- quantile(water_potability$Sulfate, probs = c(.25, .75), na.rm = FALSE)
quartiles
25% 75%
308.1884 358.3020
iqr <- IQR(water_potability$Sulfate)
iqr
[1] 50.11358
lower <- quartiles[1] - 1.5*iqr
lower
25%
233.0181
upper <- quartiles[2] + 1.5*iqr
upper
75%
433.4724
boxplot(Sulfate ~ Potability, data = water_potability)
repeat {
out_val <- boxplot(water_potability$Sulfate, ylab = 'Sulfate')$out
out_val
out_rows <- which(water_potability$Sulfate %in% c(out_val))
out_rows
if(sum(out_rows) > 0) water_potability <- water_potability[-out_rows,]
else {break}
}
summary(water_potability$Sulfate)
Min. 1st Qu. Median Mean 3rd Qu. Max.
237.5 309.2 332.8 333.6 357.7 429.8
#-------------------------------------------
summary(water_potability$Conductivity)
Min. 1st Qu. Median Mean 3rd Qu. Max.
201.6 366.6 423.6 426.8 482.6 753.3
quartiles <- quantile(water_potability$Conductivity, probs = c(.25, .75), na.rm = FALSE)
quartiles
25% 75%
366.5581 482.5983
iqr <- IQR(water_potability$Conductivity)
iqr
[1] 116.0401
lower <- quartiles[1] - 1.5*iqr
lower
25%
192.4979
upper <- quartiles[2] + 1.5*iqr
upper
75%
656.6585
boxplot(Conductivity ~ Potability, data = water_potability)
repeat {
out_val <- boxplot(water_potability$Conductivity, ylab = 'Conductivity')$out
out_val
out_rows <- which(water_potability$Conductivity %in% c(out_val))
out_rows
if(sum(out_rows) > 0) water_potability <- water_potability[-out_rows,]
else {break}
}
summary(water_potability$Conductivity)
Min. 1st Qu. Median Mean 3rd Qu. Max.
201.6 366.4 423.1 426.0 481.9 652.5
#-------------------------------------------
summary(water_potability$Organic_carbon)
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.372 12.184 14.351 14.417 16.788 27.007
quartiles <- quantile(water_potability$Organic_carbon, probs = c(.25, .75), na.rm = FALSE)
quartiles
25% 75%
12.18447 16.78779
iqr <- IQR(water_potability$Organic_carbon)
iqr
[1] 4.603315
lower <- quartiles[1] - 1.5*iqr
lower
25%
5.279502
upper <- quartiles[2] + 1.5*iqr
upper
75%
23.69276
boxplot(Organic_carbon ~ Potability, data = water_potability)
repeat {
out_val <- boxplot(water_potability$Organic_carbon, ylab = 'Organic_carbon')$out
out_val
out_rows <- which(water_potability$Organic_carbon %in% c(out_val))
out_rows
if(sum(out_rows) > 0) water_potability <- water_potability[-out_rows,]
else {break}
}
summary(water_potability$Organic_carbon)
Min. 1st Qu. Median Mean 3rd Qu. Max.
5.512 12.222 14.352 14.426 16.786 23.604
#-------------------------------------------
summary(water_potability$Trihalomethanes)
Min. 1st Qu. Median Mean 3rd Qu. Max.
8.577 55.865 66.231 66.364 77.418 124.000
quartiles <- quantile(water_potability$Trihalomethanes, probs = c(.25, .75), na.rm = FALSE)
quartiles
25% 75%
55.86494 77.41789
iqr <- IQR(water_potability$Trihalomethanes)
iqr
[1] 21.55295
lower <- quartiles[1] - 1.5*iqr
lower
25%
23.53552
upper <- quartiles[2] + 1.5*iqr
upper
75%
109.7473
boxplot(Trihalomethanes ~ Potability, data = water_potability)
repeat {
out_val <- boxplot(water_potability$Trihalomethanes, ylab = 'Trihalomethanes')$out
out_val
out_rows <- which(water_potability$Trihalomethanes %in% c(out_val))
out_rows
if(sum(out_rows) > 0) water_potability <- water_potability[-out_rows,]
else {break}
}
summary(water_potability$Trihalomethanes)
Min. 1st Qu. Median Mean 3rd Qu. Max.
24.53 55.96 66.29 66.42 77.34 108.85
#-------------------------------------------
summary(water_potability$Turbidity)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.450 3.441 3.975 3.973 4.519 6.495
quartiles <- quantile(water_potability$Turbidity, probs = c(.25, .75), na.rm = FALSE)
quartiles
25% 75%
3.440859 4.518751
iqr <- IQR(water_potability$Turbidity)
iqr
[1] 1.077892
lower <- quartiles[1] - 1.5*iqr
lower
25%
1.824021
upper <- quartiles[2] + 1.5*iqr
upper
75%
6.135588
boxplot(Turbidity ~ Potability, data = water_potability)
repeat {
out_val <- boxplot(water_potability$Turbidity, ylab = 'Turbidity')$out
out_val
out_rows <- which(water_potability$Turbidity %in% c(out_val))
out_rows
if(sum(out_rows) > 0) water_potability <- water_potability[-out_rows,]
else {break}
}
summary(water_potability$Turbidity)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.873 3.443 3.974 3.972 4.512 6.084
dim(water_potability)
[1] 1750 10
str(water_potability)
'data.frame': 1750 obs. of 10 variables:
$ ph : num 8.32 9.09 5.58 10.22 8.64 ...
$ Hardness : num 214 181 188 248 203 ...
$ Solids : num 22018 17979 28749 28750 13672 ...
$ Chloramines : num 8.06 6.55 7.54 7.51 4.56 ...
$ Sulfate : num 357 310 327 394 303 ...
$ Conductivity : num 363 398 280 284 475 ...
$ Organic_carbon : num 18.4 11.6 8.4 13.8 12.4 ...
$ Trihalomethanes: num 100.3 32 54.9 84.6 62.8 ...
$ Turbidity : num 4.63 4.08 2.56 2.67 4.4 ...
$ Potability : Factor w/ 2 levels "Not Potable",..: 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "na.action")= 'omit' Named int [1:1265] 1 2 3 9 12 14 15 17 19 21 ...
..- attr(*, "names")= chr [1:1265] "1" "2" "3" "9" ...
head(water_potability)
visual representation of data that help us understand and analyze information more easily. They can be used to display trends, comparisons, and relationships between different variables. There are various types of charts, such as
The histogram shows the frequency of ph in the dataset; we noted that the majority of values fall within the usual range, which is about between 6 and 8, but it also shows several outliers.
hist(water_potability$ph)
hist(water_potability$Chloramines)
hist(water_potability$Hardness)
hist(water_potability$Solids)
hist(water_potability$Sulfate)
hist(water_potability$Conductivity)
hist(water_potability$Organic_carbon)
hist(water_potability$Trihalomethanes)
hist(water_potability$Turbidity)
the bar plot represent how ph levels affect water portability in the dataset it indicates that ph level above 10 is not portibal and humans cant consume it
tab <- water_potability$Potability %>% table()
txt <- paste0(tab)
bb <- water_potability$ph %>% table() %>% barplot( main='ph',col=c('pink'))
bb <- water_potability$Potability %>% table() %>% barplot( main='Potability',ylab='Frequency',col=c('pink', 'lightblue'))
text(bb, tab/2, labels=txt, cex=1)
This scatter demonstrates the correlation and proportionality between the two qualities, allowing us to establish whether or not turbidity and pH are connected.
with(water_potability, plot(Trihalomethanes, ph, col = Potability, pch = as.numeric(Potability)))
This will find the correlation between the features and represent it in heat map
correlation_matrix <- cor(water_potability[,1:9])
high_correlation_features <- findCorrelation(correlation_matrix, cutoff = 0.5)
print(high_correlation_features)
integer(0)
heatmap(correlation_matrix)
we remove the correlation between the features and represent it in heat map
ranking features by importance is a technique used to identify the most influential variables in a dataset for predicting a target variable. This process helps in understanding which features have the most impact on the model’s performance. By ranking features by importance.
#train random forest model and calculate feature importance
rf = randomForest(x= water_potability[,1:9],y= water_potability[,10])
var_imp <- varImp(rf, scale = FALSE)
#sort the score in decreasing order
var_imp_df <- data.frame(cbind(variable = rownames(var_imp), score = var_imp[,1]))
var_imp_df$score <- as.double(var_imp_df$score)
var_imp_df[order(var_imp_df$score,decreasing = TRUE),]
ggplot(var_imp_df, aes(x=reorder(variable, score), y=score)) +
geom_point() +
geom_segment(aes(x=variable,xend=variable,y=0,yend=score)) +
ylab("IncNodePurity") +
xlab("Variable Name") +
coord_flip()
is a feature selection technique that recursively removes less important features from a model until the optimal subset is identified. It involves repeatedly training the model, ranking features based on their importance, and eliminating the least important ones.
control <- rfeControl(functions=rfFuncs, method="cv",number=10)
rf <- trainControl(method = "cv", number = 10, verboseIter = FALSE)
# run the RFE algorithm
rfe_model <- rfe(x= water_potability[,1:9],y= water_potability[,10], sizes=c(1:9), rfeControl=control)
# summarize the results
print(rfe_model)
# list the chosen features
predictors(rfe_model)
# plot the results
plot(rfe_model, type=c("g", "o"))
the process of converting or modifying raw data into a different format or structure to make it more suitable for analysis or modeling
Normalization refers to the process of scaling variables to have a common range. It helps in comparing variables with different scales.
wp<- water_potability
normalize=function(x){return ((x-min(x))/(max(x)))}
wp$Solids=normalize(wp$Solids)
The solids attribute will create critical challenges because of the vast and diverted values: min is 320.9, and max is 43195.5, so we normalized the solids between 0 and 1 to make values smaller and more reasonable.
Discretization is the process of transforming continuous variables into discrete or categorical variables. It can be useful for analyzing data with many unique values or simplifying it.
wp$ph= cut(wp$ph, breaks = seq(3,11,by=4),right=FALSE)
wp$Hardness= cut(wp$Hardness, breaks = seq(120,280,by=40),right=FALSE)
wp$Chloramines = cut(wp$Chloramines, breaks = seq(3,11,by=4),right = FALSE)
wp$Sulfate= cut(wp$Sulfate, breaks = seq(220,440,by=44),right=FALSE)
wp$Conductivity= cut(wp$Conductivity, breaks = seq(200,700,by=100),right=FALSE)
wp$Organic_carbon= cut(wp$Organic_carbon, breaks = seq(4,24,by=4),right=FALSE)
wp$Trihalomethanes= cut(wp$Trihalomethanes, breaks = seq(20,110,by=10),right=FALSE)
wp$Turbidity= cut(wp$Turbidity, breaks = seq(1,7,by=2),right=FALSE)
print(wp)
Therefore, we transformed the continuous values of the numeric attributes into intervals by dividing the values to fall on one of the possible interval labels by discretization. The values will be meaningful and simpler to classify or perform other methods to help us later in our model. So, In Trihalomethanes, we intervals by dividing the values by 10 to have labels with equal width : [20,30) [30,40) [40,50) [50,60) [60,70) [70,80) [80,90) [90,100) [100,110).
encoding is the process of converting characters or strings into a specific encoding format. Since we don’t have a Nominal attribute in our database we couldn’t implement it.
wp
For classification, we chose the decision tree algorithm, a recursive approach constructing a tree structure with leaf nodes signifying final decisions. The objective was to predict the class label (potability), with values 0 or 1, based on attributes like pH, Hardness, Solids, Chloramines, Sulfate, Conductivity, Organic_carbon, Trihalomethanes, and Turbidity. The dataset underwent division into training and testing sets for constructing and evaluating the decision tree. Model evaluation encompassed metrics like accuracy and cost-sensitive measures, gauged using a confusion matrix. Our toolkit included packages such as ‘party’ and ‘caret,’ incorporating methods like ‘sample’ for data splitting, ‘ctree’ for decision tree construction, ‘predict’ for testing predictions, and ‘confusionMatrix’ for model evaluation.
In the unsupervised clustering phase, we excluded the class label attribute “potability” and utilized numeric attributes such as pH, Hardness, Solids, Chloramines, Sulfate, Conductivity, Organic_carbon, Trihalomethanes, and Turbidity. Employing the K-means algorithm, clusters were formed, each represented by a center point, and objects were assigned to the nearest cluster. For this phase, we made use of packages such as ‘cluster’ and ‘factoextra,’ incorporating methods like ‘scale()’ for data scaling, ‘Kmeans()’ for cluster creation.
Cluster validation was performed using the ‘silhouette()’ method to calculate averages for each cluster. In both supervised and unsupervised techniques, we maintained result consistency by employing the ‘set.seed()’ method with the same random number when experimenting with different dataset sizes.
#Training technique
In the provided code, we systematically addressed outliers in multiple columns of our dataset. Beginning with a summary of each column’s statistics, including quartiles and the interquartile range (IQR), we established outlier detection limits. A visual assessment was conducted using boxplots, categorized by relevant variables such as “Potability.” A loop was implemented to iteratively identify and remove outliers in each column, ensuring a robust cleansing process. The final step involved summarizing the columns post-outlier removal, offering insights into the impact on the distribution of each variable. This comprehensive approach was applied uniformly to all dataset columns, promoting consistency in the outlier-handling process.
In this R code, the dataset undergoes a process of training and testing using the ID3 algorithm for decision tree classification. The data is successively split into training sets of 70%, 80%, and 90%, with corresponding testing sets of 30%, 20%, and 10%. The decision tree models are trained on these subsets, utilizing features such as pH, hardness, solids, chloramines, sulfate, conductivity, organic carbon, trihalomethanes, and turbidity to predict water potability.
For each split, the decision tree models are evaluated on their respective testing sets, and performance metrics such as confusion matrices and accuracy are computed. Additionally, Receiver Operating Characteristic (ROC) curves are generated, providing insights into the models’ discrimination capabilities. The Area Under the Curve (AUC) is calculated as a quantitative measure of model performance. This comprehensive approach enables a systematic exploration of the ID3 decision tree’s effectiveness in predicting water potability under varying training and testing scenarios.
Splitting the data set into two subsets: Training(70%) and Testing(30%):
set.seed(1958)
ind <- sample(2, nrow(wp), replace = TRUE, prob = c(0.7, 0.3))
train.data <- wp[ind == 1, ]
test.data <- wp[ind == 2, ]
train.data$Potability <- as.factor(train.data$Potability)
test.data$Potability <- as.factor(test.data$Potability)
myFormula <- Potability ~ ph+Hardness+Solids+Chloramines+Sulfate+Conductivity+Organic_carbon+Trihalomethanes+Turbidity
#myFormula <- Potability ~ ph+Hardness+Solids+Chloramines+Sulfate
m.ctree <- ctree(myFormula, data = train.data)
table(predict(m.ctree), train.data$Potability)
print(m.ctree)
plot(m.ctree, type="simple")
testPred <- predict(m.ctree, newdata = test.data)
result<-table(testPred, test.data$Potability)
co_result <- confusionMatrix(result)
print(co_result)
as.matrix(co_result, what = "classes")
acc <- co_result$overall["Accuracy"]
acc*100
pred_probs <- as.numeric(predict(m.ctree, newdata = test.data, type = "response"))
binary_outcome <- as.numeric(test.data$Potability == "Potable")
# ROC curve
roc_curve <- roc(binary_outcome, pred_probs)
plot(roc_curve, main = "ROC Curve", col = "blue", lwd = 2)
abline(a = 0, b = 1, col = "gray", lty = 2)
# Print AUC
cat("AUC:", auc(roc_curve), "\n")
Splitting the data set into two subsets: Training(80%) and Testing(20%):
set.seed(1958)
ind <- sample(2, nrow(wp), replace = TRUE, prob = c(0.8, 0.2))
train.data <- wp[ind == 1, ]
test.data <- wp[ind == 2, ]
train.data$Potability <- as.factor(train.data$Potability)
myFormula <- Potability ~ ph+Hardness+Solids+Chloramines+Sulfate+Conductivity+Organic_carbon+Trihalomethanes+Turbidity
#myFormula <- Potability ~ ph+Hardness+Solids+Chloramines+Sulfate
m.ctree <- ctree(myFormula, data = train.data)
table(predict(m.ctree), train.data$Potability)
print(m.ctree)
plot(m.ctree, type="simple")
testPred <- predict(m.ctree, newdata = test.data)
result<-table(testPred, test.data$Potability)
co_result <- confusionMatrix(result)
print(co_result)
as.matrix(co_result, what = "classes")
acc <- co_result$overall["Accuracy"]
acc*100
pred_probs <- as.numeric(predict(m.ctree, newdata = test.data, type = "response"))
binary_outcome <- as.numeric(test.data$Potability == "Potable")
# ROC curve
roc_curve <- roc(binary_outcome, pred_probs)
plot(roc_curve, main = "ROC Curve", col = "blue", lwd = 2)
abline(a = 0, b = 1, col = "gray", lty = 2)
# Print AUC
cat("AUC:", auc(roc_curve), "\n")
Splitting the data set into two subsets: Training(90%) and Testing(10%):
set.seed(1958)
ind <- sample(2, nrow(wp), replace = TRUE, prob = c(0.9, 0.1))
train.data <- wp[ind == 1, ]
test.data <- wp[ind == 2, ]
train.data$Potability <- as.factor(train.data$Potability)
myFormula <- Potability ~ ph+Hardness+Solids+Chloramines+Sulfate+Conductivity+Organic_carbon+Trihalomethanes+Turbidity
#myFormula <- Potability ~ ph+Hardness+Solids+Chloramines+Sulfate
m.ctree <- ctree(myFormula, data = train.data)
table(predict(m.ctree), train.data$Potability)
print(m.ctree)
plot(m.ctree, type="simple")
testPred <- predict(m.ctree, newdata = test.data)
result<-table(testPred, test.data$Potability)
co_result <- confusionMatrix(result)
print(co_result)
as.matrix(co_result, what = "classes")
acc <- co_result$overall["Accuracy"]
acc*100
pred_probs <- as.numeric(predict(m.ctree, newdata = test.data, type = "response"))
binary_outcome <- as.numeric(test.data$Potability == "Potable")
# ROC curve
roc_curve <- roc(binary_outcome, pred_probs)
plot(roc_curve, main = "ROC Curve", col = "blue", lwd = 2)
abline(a = 0, b = 1, col = "gray", lty = 2)
# Print AUC
cat("AUC:", auc(roc_curve), "\n")
a decision tree-based classifier, is utilized for model training and evaluation. The dataset undergoes cross-validation with different fold settings (3 folds, 5 folds, and 10 folds). For each fold configuration, a J48 model is trained, and its predictive performance is assessed using Receiver Operating Characteristic (ROC) curves. The Area Under the Curve is calculated for each ROC curve, providing a quantitative measure of the model’s ability to predict water potability. Notably, the visual inspection of the ROC curves indicates that the model trained with 10-fold cross-validation exhibits the highest discriminative performance. This comparative analysis across various cross-validation scenarios offers valuable insights into the robustness and generalization capability.
# 3 folds
set.seed(1958)
train <- createFolds(wp$Potability, k=3)
C45Fit <- train(Potability ~ .,method = "J48",data = wp,
trControl = trainControl(
method = "cv",
index = train,
savePredictions = TRUE))
C45Fit
C45Fit$finalModel
pred_probs <- predict(C45Fit, newdata = wp, type = "prob")[, "Potable"]
binary_outcome <- as.numeric(wp$Potability == "Potable")
# ROC curve
roc_curve <- roc(binary_outcome, pred_probs)
plot(roc_curve, main = "ROC Curve", col = "blue", lwd = 2)
abline(a = 0, b = 1, col = "gray", lty = 2)
# Print AUC
cat("AUC:", auc(roc_curve), "\n")
# 5 folds
set.seed(1958)
train <- createFolds(wp$Potability, k=5)
C45Fit <- train(Potability ~., method="J48", data=wp,
trControl = trainControl(
method ="cv",
index = train,
savePredictions = TRUE))
C45Fit
C45Fit$finalModel
pred_probs <- predict(C45Fit, newdata = wp, type = "prob")[, "Potable"]
binary_outcome <- as.numeric(wp$Potability == "Potable")
# ROC curve
roc_curve <- roc(binary_outcome, pred_probs)
plot(roc_curve, main = "ROC Curve", col = "blue", lwd = 2)
abline(a = 0, b = 1, col = "gray", lty = 2)
# Print AUC
cat("AUC:", auc(roc_curve), "\n")
# 10 folds
set.seed(1958)
train <- createFolds(wp$Potability, k=10)
C45Fit <- train(Potability ~., method="J48", data=wp,
trControl = trainControl(
method="cv", indexOut=train))
C45Fit
C45Fit$finalModel
pred_probs <- predict(C45Fit, newdata = wp, type = "prob")[, "Potable"]
binary_outcome <- as.numeric(wp$Potability == "Potable")
# ROC curve
roc_curve <- roc(binary_outcome, pred_probs)
plot(roc_curve, main = "ROC Curve", col = "blue", lwd = 2)
abline(a = 0, b = 1, col = "gray", lty = 2)
# Print AUC
cat("AUC:", auc(roc_curve), "\n")
C5.0 newer version of C4.5 Splitting the data set into two subsets: Training(70%) and Testing(30%):
set.seed(1958)
train.indices <- sample(2, nrow(water_potability), replace=TRUE, prob=c(0.7, 0.3))
w.train <- water_potability[train.indices == 1, ]
w.test <- water_potability[train.indices == 2, ]
w.train$Potability <- as.factor(w.train$Potability)
model <- C5.0(Potability ~., data=w.train)
results <- predict(object=model, newdata=w.test, type="class")
table(results, w.test$Potability)
plot(model)
r <- confusionMatrix(results, w.test$Potability)
acc <- r$overall["Accuracy"]*100
acc
as.matrix(r, what = "classes")
print(r)
pred_probs <- predict(model, newdata = w.test, type = "prob")[, "Potable"]
binary_outcome <- as.numeric(w.test$Potability == "Potable")
# ROC curve
roc_curve <- roc(binary_outcome, pred_probs)
plot(roc_curve, main = "ROC Curve", col = "blue", lwd = 2)
abline(a = 0, b = 1, col = "gray", lty = 2)
# Print AUC
cat("AUC:", auc(roc_curve), "\n")
Splitting the data set into two subsets: Training(80%) and Testing(20%):
set.seed(1958)
train.indices <- sample(2, nrow(water_potability), replace=TRUE, prob=c(0.8, 0.2))
w.train <- water_potability[train.indices == 1, ]
w.test <- water_potability[train.indices == 2, ]
w.train$Potability <- as.factor(w.train$Potability)
model <- C5.0(Potability ~., data=w.train)
results <- predict(object=model, newdata=w.test, type="class")
table(results, w.test$Potability)
plot(model)
r <- confusionMatrix(results, w.test$Potability)
acc <- r$overall["Accuracy"]*100
acc
as.matrix(r, what = "classes")
print(r)
pred_probs <- predict(model, newdata = w.test, type = "prob")[, "Potable"]
binary_outcome <- as.numeric(w.test$Potability == "Potable")
# ROC curve
roc_curve <- roc(binary_outcome, pred_probs)
plot(roc_curve, main = "ROC Curve", col = "blue", lwd = 2)
abline(a = 0, b = 1, col = "gray", lty = 2)
# Print AUC
cat("AUC:", auc(roc_curve), "\n")
Splitting the data set into two subsets: Training(90%) and Testing(10%):
set.seed(1958)
train.indices <- sample(2, nrow(water_potability), replace=TRUE, prob=c(0.9, 0.1))
w.train <- water_potability[train.indices == 1, ]
w.test <- water_potability[train.indices == 2, ]
w.train$Potability <- as.factor(w.train$Potability)
model <- C5.0(Potability ~., data=w.train)
results <- predict(object=model, newdata=w.test, type="class")
table(results, w.test$Potability)
plot(model)
###To improve the readability of the decision tree, we decided to sample the data using only the pH and sulfate attributes. We then split the data into training and testing sets using the same split points: ###Training(90%) and Testing(10%), which allowed for a more manageable decision tree:
set.seed(1958)
importent_feature_sample <- select(water_potability,c(1,5,10))
train.indices <- sample(2, nrow(importent_feature_sample), replace=TRUE, prob=c(0.9, 0.1))
w.train <- importent_feature_sample[train.indices == 1, ]
w.test <- importent_feature_sample[train.indices == 2, ]
w.train$Potability <- as.factor(w.train$Potability)
model <- C5.0(Potability ~., data=w.train)
results <- predict(object=model, newdata=w.test, type="class")
table(results, w.test$Potability)
plot(model)
r <- confusionMatrix(results, w.test$Potability)
acc <- r$overall["Accuracy"]*100
acc
as.matrix(r, what = "classes")
print(r)
pred_probs <- predict(model, newdata = w.test, type = "prob")[, "Potable"]
binary_outcome <- as.numeric(w.test$Potability == "Potable")
# ROC curve
roc_curve <- roc(binary_outcome, pred_probs)
plot(roc_curve, main = "ROC Curve", col = "blue", lwd = 2)
abline(a = 0, b = 1, col = "gray", lty = 2)
# Print AUC
cat("AUC:", auc(roc_curve), "\n")
###Gini index (CART)
we employed the C5.0 algorithm, an enhanced version of the C4.5 decision tree, for model training and evaluation across different training and testing set splits. The dataset underwent three scenarios: Training(70%) and Testing(30%), Training(80%) and Testing(20%), and Training(90%) and Testing(10%). For each case, the C5.0 model was trained on the designated training data, evaluated on the testing data, and its performance was assessed through accuracy, confusion matrix, and ROC curve with Area Under the Curve (AUC).
Upon comparative analysis, it was observed that the model trained with a larger proportion of data (Training 90%, Testing 10%) demonstrated superior performance, achieving higher accuracy and a more discriminative ROC curve. This exploration across different training and testing splits provides valuable insights into the robustness and generalization capability of the C5.0 decision tree algorithm for predicting water potability.
Splitting the data set into two subsets: Training(70%) and Testing(30%):
set.seed(1958)
train = sample(2, nrow(wp), replace=TRUE, prob=c(0.7, 0.3))
wp.train=wp[train == 1,]
wp.test=wp[train == 2,]
fit.tree = rpart(Potability ~ ., data=wp, method = "class", cp=0.008)
fit.tree
rpart.plot(fit.tree)
fit.tree$variable.importance
pred.tree = predict(fit.tree, wp.test, type = "class")
re <- table(pred.tree, wp.test$Potability)
co_re <- confusionMatrix(re)
print(co_re)
as.matrix(co_re, what = "classes")
acc <- co_re$overall["Accuracy"]
acc*100
plotcp(fit.tree)
printcp(fit.tree)
# Explicitly request the lowest cp value
fit.tree$cptable[which.min(fit.tree$cptable[,"xerror"]),"CP"]
bestcp <-fit.tree$cptable[which.min(fit.tree$cptable[,"xerror"]),"CP"]
pruned.tree <- prune(fit.tree, cp = bestcp)
rpart.plot(pruned.tree)
pred.prune = predict(pruned.tree, wp.test, type="class")
re <- table(pred.prune, wp.test$Potability)
co_re <- confusionMatrix(re)
print(co_re)
as.matrix(co_re, what = "classes")
acc <- co_re$overall["Accuracy"]
acc*100
pred.tree_raw <- predict(fit.tree, wp.test)
# Convert to probabilities
pred.tree_probs <- exp(pred.tree_raw) / (1 + exp(pred.tree_raw))
# Extract probabilities for the "Potable" class
roc_curve <- roc(ifelse(wp.test$Potability == "Potable", 1, 0), pred.tree_probs[, "Potable"])
plot(roc_curve, main = "ROC Curve", col = "blue", lwd = 2)
abline(a = 0, b = 1, col = "gray", lty = 2)
# Print AUC
cat("AUC:", auc(roc_curve), "\n")
Splitting the data set into two subsets: Training(80%) and Testing(20%):
set.seed(1958)
train = sample(2, nrow(wp), replace=TRUE, prob=c(0.8, 0.2))
wp.train=wp[train == 1,]
wp.test=wp[train == 2,]
fit.tree = rpart(Potability ~ ., data=wp.train, method = "class", cp=0.008)
fit.tree
rpart.plot(fit.tree)
fit.tree$variable.importance
pred.tree = predict(fit.tree, wp.test, type = "class")
re <- table(pred.tree, wp.test$Potability)
co_re <- confusionMatrix(re)
print(co_re)
as.matrix(co_re, what = "classes")
acc <- co_re$overall["Accuracy"]
acc*100
plotcp(fit.tree)
printcp(fit.tree)
# Explicitly request the lowest cp value
fit.tree$cptable[which.min(fit.tree$cptable[,"xerror"]),"CP"]
bestcp <-fit.tree$cptable[which.min(fit.tree$cptable[,"xerror"]),"CP"]
pruned.tree <- prune(fit.tree, cp = bestcp)
rpart.plot(pruned.tree)
pred.prune = predict(pruned.tree, wp.test, type="class")
re <- table(pred.prune, wp.test$Potability)
co_re <- confusionMatrix(re)
print(co_re)
as.matrix(co_re, what = "classes")
acc <- co_re$overall["Accuracy"]
acc*100
pred.tree_raw <- predict(fit.tree, wp.test)
pred.tree_probs <- exp(pred.tree_raw) / (1 + exp(pred.tree_raw))
roc_curve <- roc(ifelse(wp.test$Potability == "Potable", 1, 0), pred.tree_probs[, "Potable"])
plot(roc_curve, main = "ROC Curve", col = "blue", lwd = 2)
abline(a = 0, b = 1, col = "gray", lty = 2)
# Print AUC
cat("AUC:", auc(roc_curve), "\n")
Splitting the data set into two subsets: Training(90%) and Testing(10%):
set.seed(1958)
train = sample(2, nrow(wp), replace=TRUE, prob=c(0.9, 0.1))
wp.train=wp[train == 1,]
wp.test=wp[train == 2,]
fit.tree = rpart(Potability ~ ., data=wp.train, method = "class", cp=0.008)
fit.tree
rpart.plot(fit.tree)
fit.tree$variable.importance
pred.tree = predict(fit.tree, wp.test, type = "class")
re <- table(pred.tree, wp.test$Potability)
co_re <- confusionMatrix(re)
print(co_re)
as.matrix(co_re, what = "classes")
acc <- co_re$overall["Accuracy"]
acc*100
plotcp(fit.tree)
printcp(fit.tree)
# Explicitly request the lowest cp value
fit.tree$cptable[which.min(fit.tree$cptable[,"xerror"]),"CP"]
bestcp <-fit.tree$cptable[which.min(fit.tree$cptable[,"xerror"]),"CP"]
pruned.tree <- prune(fit.tree, cp = bestcp)
rpart.plot(pruned.tree)
pred.prune = predict(pruned.tree, wp.test, type="class")
re <- table(pred.prune, wp.test$Potability)
co_re <- confusionMatrix(re)
print(co_re)
as.matrix(co_re, what = "classes")
acc <- co_re$overall["Accuracy"]
acc*100
pred.tree_raw <- predict(fit.tree, wp.test)
pred.tree_probs <- exp(pred.tree_raw) / (1 + exp(pred.tree_raw))
roc_curve <- roc(ifelse(wp.test$Potability == "Potable", 1, 0), pred.tree_probs[, "Potable"])
plot(roc_curve, main = "ROC Curve", col = "blue", lwd = 2)
abline(a = 0, b = 1, col = "gray", lty = 2)
# Print AUC
cat("AUC:", auc(roc_curve), "\n")
| Information Gain: | Gain Ratio: | Gini index: | |
|---|---|---|---|
| Accuracy | 0.5820611 | 0.9686781 | 0.5935115 |
| precision | 0.58546169 | 0.005417132 | 0.6004320 |
| sensitivity | 0.97385621 | 0.9334741 | 0.9084967 |
| specificity | 0.03211009 | 0.880677 | 0.1513761 |
As shown in the comparison the best technique to choose is Gain Raio Due to the high accuracy
summary(water_potability)
str(water_potability)
we will use four different sizes of k for clustering and then see what performs best between them.
Confirm that all the columns you are trying to scale are indeed numeric. You can use sapply() to check and coerce them to numeric if necessary.
sinec all coulme are numeric we wll scale all of them expet class label and we saved it in dataset called Cluster and we used it in Clustring
water_potability<- sapply(water_potability, as.numeric)
data_for_cluster <- scale(water_potability[, !colnames(water_potability) %in% "Potability"])
#we use !colnames(water_potability) %in% "Potability" to exclude the "Potability" column
View(data_for_cluster)
# 3- run k-means clustering to find 2 clusters
#set a seed for random number generation to make the results reproducible
set.seed(8953)
kmeans.result <- kmeans(data_for_cluster,2)
# print the clusterng result
kmeans.result
# visualize clustering (2 clusters)
fviz_cluster(kmeans.result, data = data_for_cluster)
we took 50 sample to make it more understabale
# draw a sample of 50 records from the data, so that the clustering plot will not be over crowded and easy to undrestand
idx<-sample(1:dim(data_for_cluster)[1], 50)
sample_c1<-data_for_cluster[idx, ]
## hiercrchical clustering
hc.cut<- hcut(sample_c1, k = 2, hc_method= "complete")
dendrogram is a tree diagram that displays the arrangement of data points in a hierarchical order based on their similarity or dissimilarity.
# Visualize dendrogram
fviz_dend(hc.cut,rect= TRUE)
# Visualize cluster
fviz_cluster(hc.cut, ellipse.type= "convex")
This method calculates the average silhouette width for different values of k, determining how well data points fit into their assigned clusters.
#average silhouette for each clusters
avg_sil <- silhouette(kmeans.result$cluster,dist(data_for_cluster)) #a dissimilarity object inheriting from class dist or coercible to one. If not specified, dmatrix must be.
fviz_silhouette(avg_sil)#k-means clustering with estimating k and initializations
BCubed precision and recall are metrics used to evaluate the performance of clustering algorithms, particularly in the context of evaluating the quality of clustering assignments for individual data points.
cluster_assignments <- c(kmeans.result$cluster)
ground_truth_labels <- c(water_potability)
data <- data.frame(cluster = cluster_assignments, label = ground_truth_labels)
# Function to calculate BCubed precision and recall
calculate_bcubed_metrics <- function(data) {
n <- nrow(data)
precision_sum <- 0
recall_sum <- 0
for (i in 1:n) {
cluster <- data$cluster[i]
label <- data$label[i]
# Count the number of items from the same category within the same cluster
same_category_same_cluster <- sum(data$label[data$cluster == cluster] == label)
# Count the total number of items in the same cluster
total_same_cluster <- sum(data$cluster == cluster)
# Count the total number of items with the same category
total_same_category <- sum(data$label == label)
# Calculate precision and recall for the current item and add them to the sums
precision_sum <- precision_sum + same_category_same_cluster /total_same_cluster
recall_sum <- recall_sum + same_category_same_cluster / total_same_category
}
# Calculate average precision and recall
precision <- precision_sum / n
recall <- recall_sum / n
return(list(precision = precision, recall = recall))
}
# Calculate BCubed precision and recall
metrics <- calculate_bcubed_metrics(data)
# Extract precision and recall from the metrics
precision <- metrics$precision
recall <- metrics$recall
# Print the results
cat("BCubed Precision:", precision, "\n")
cat("BCubed Recall:", recall, "\n")
run k-means clustering to find 3 clusters set a seed for random number generation to make the results reproducible
set.seed(8953)
kmeans.result <- kmeans(data_for_cluster,3)
# print the Clustring result
kmeans.result
# visualize clustering (3 clusters)
fviz_cluster(kmeans.result, data = data_for_cluster)
# draw a sample of 50 records from the data, so that the clustering plot will not be over crowded and easy to undrestand
idx2<-sample(1:dim(data_for_cluster)[1], 50)
sample_c2<-data_for_cluster[idx2, ]
## hiercrchical clustering
hc2.cut<- hcut(sample_c2, k = 3, hc_method= "complete")
dendrogram is a tree diagram that displays the arrangement of data points in a hierarchical order based on their similarity or dissimilarity.
fviz_dend(hc2.cut,rect= TRUE)
# Visualize cluster
fviz_cluster(hc2.cut, ellipse.type= "convex")
This method calculates the average silhouette width for different values of k, determining how well data points fit into their assigned clusters.
#average silhouette for each clusters
avg_sil <- silhouette(kmeans.result$cluster,dist(data_for_cluster)) #a dissimilarity object inheriting from class dist or coercible to one. If not specified, dmatrix must be.
fviz_silhouette(avg_sil)#k-means clustering with estimating k and initializations
BCubed precision and recall are metrics used to evaluate the performance of clustering algorithms, particularly in the context of evaluating the quality of clustering assignments for individual data points.
cluster_assignments <- c(kmeans.result$cluster)
ground_truth_labels <- c(water_potability)
data <- data.frame(cluster = cluster_assignments, label = ground_truth_labels)
# Function to calculate BCubed precision and recall
calculate_bcubed_metrics <- function(data) {
n <- nrow(data)
precision_sum <- 0
recall_sum <- 0
for (i in 1:n) {
cluster <- data$cluster[i]
label <- data$label[i]
# Count the number of items from the same category within the same cluster
same_category_same_cluster <- sum(data$label[data$cluster == cluster] == label)
# Count the total number of items in the same cluster
total_same_cluster <- sum(data$cluster == cluster)
# Count the total number of items with the same category
total_same_category <- sum(data$label == label)
# Calculate precision and recall for the current item and add them to the sums
precision_sum <- precision_sum + same_category_same_cluster /total_same_cluster
recall_sum <- recall_sum + same_category_same_cluster / total_same_category
}
# Calculate average precision and recall
precision <- precision_sum / n
recall <- recall_sum / n
return(list(precision = precision, recall = recall))
}
# Calculate BCubed precision and recall
metrics <- calculate_bcubed_metrics(data)
# Extract precision and recall from the metrics
precision <- metrics$precision
recall <- metrics$recall
# Print the results
cat("BCubed Precision:", precision, "\n")
cat("BCubed Recall:", recall, "\n")
# 3- run k-means clustering to find 4 clusters
#set a seed for random number generation to make the results reproducible
set.seed(8953)
kmeans.result <- kmeans(data_for_cluster,4)
# print the clustering result
kmeans.result
# visualize clustering (4 clusters)
fviz_cluster(kmeans.result, data = data_for_cluster)
# draw a sample of 50 records from the data, so that the clustering plot will not be over crowded and easy to undrestand
idx3<-sample(1:dim(data_for_cluster)[1], 50)
sample_c3<-data_for_cluster[idx3, ]
## hiercrchicalclustering
hc3.cut<- hcut(sample_c3, k = 4, hc_method= "complete")
dendrogram is a tree diagram that displays the arrangement of data points in a hierarchical order based on their similarity or dissimilarity.
# Visualize dendrogram
fviz_dend(hc3.cut,rect= TRUE)
# Visualize cluster
fviz_cluster(hc3.cut, ellipse.type= "convex")
This method calculates the average silhouette width for different values of k, determining how well data points fit into their assigned clusters.
#average silhouette for each clusters
avg_sil <- silhouette(kmeans.result$cluster,dist(data_for_cluster)) #a dissimilarity object inheriting from class dist or coercible to one. If not specified, dmatrix must be.
fviz_silhouette(avg_sil)#k-means clustering with estimating k and initializations
BCubed precision and recall are metrics used to evaluate the performance of clustering algorithms, particularly in the context of evaluating the quality of clustering assignments for individual data points.
cluster_assignments <- c(kmeans.result$cluster)
ground_truth_labels <- c(water_potability)
data <- data.frame(cluster = cluster_assignments, label = ground_truth_labels)
# Function to calculate BCubed precision and recall
calculate_bcubed_metrics <- function(data) {
n <- nrow(data)
precision_sum <- 0
recall_sum <- 0
for (i in 1:n) {
cluster <- data$cluster[i]
label <- data$label[i]
# Count the number of items from the same category within the same cluster
same_category_same_cluster <- sum(data$label[data$cluster == cluster] == label)
# Count the total number of items in the same cluster
total_same_cluster <- sum(data$cluster == cluster)
# Count the total number of items with the same category
total_same_category <- sum(data$label == label)
# Calculate precision and recall for the current item and add them to the sums
precision_sum <- precision_sum + same_category_same_cluster /total_same_cluster
recall_sum <- recall_sum + same_category_same_cluster / total_same_category
}
# Calculate average precision and recall
precision <- precision_sum / n
recall <- recall_sum / n
return(list(precision = precision, recall = recall))
}
# Calculate BCubed precision and recall
metrics <- calculate_bcubed_metrics(data)
# Extract precision and recall from the metrics
precision <- metrics$precision
recall <- metrics$recall
# Print the results
cat("BCubed Precision:", precision, "\n")
cat("BCubed Recall:", recall, "\n")
#set a seed for random number generation to make the results reproducible
set.seed(8953)
kmeans.result <- kmeans(data_for_cluster,5)
# print the clusterng result
kmeans.result
# visualize clustering (5 clusters)
fviz_cluster(kmeans.result, data = data_for_cluster)
# draw a sample of 50 records from the data, so that the clustering plot will not be over crowded and easy to undrestand
idx4<-sample(1:dim(data_for_cluster)[1], 50)
sample_c4<-data_for_cluster[idx4, ]
## hiercrchicalclustering
hc4.cut<- hcut(sample_c4, k = 5, hc_method= "complete")
dendrogram is a tree diagram that displays the arrangement of data points in a hierarchical order based on their similarity or dissimilarity.
# Visualize dendrogram
fviz_dend(hc4.cut,rect= TRUE)
# Visualize cluster
fviz_cluster(hc4.cut, ellipse.type= "convex")
This method calculates the average silhouette width for different values of k, determining how well data points fit into their assigned clusters. determining how well data points fit into their assigned clusters.
#average silhouette for each clusters
avg_sil <- silhouette(kmeans.result$cluster,dist(data_for_cluster)) #a dissimilarity object inheriting from class dist or coercible to one. If not specified, dmatrix must be.
fviz_silhouette(avg_sil)#k-means clustering with estimating k and initializations
BCubed precision and recall are metrics used to evaluate the performance of clustering algorithms, particularly in the context of evaluating the quality of clustering assignments for individual data points.
cluster_assignments <- c(kmeans.result$cluster)
ground_truth_labels <- c(water_potability)
data <- data.frame(cluster = cluster_assignments, label = ground_truth_labels)
# Function to calculate BCubed precision and recall
calculate_bcubed_metrics <- function(data) {
n <- nrow(data)
precision_sum <- 0
recall_sum <- 0
for (i in 1:n) {
cluster <- data$cluster[i]
label <- data$label[i]
# Count the number of items from the same category within the same cluster
same_category_same_cluster <- sum(data$label[data$cluster == cluster] == label)
# Count the total number of items in the same cluster
total_same_cluster <- sum(data$cluster == cluster)
# Count the total number of items with the same category
total_same_category <- sum(data$label == label)
# Calculate precision and recall for the current item and add them to the sums
precision_sum <- precision_sum + same_category_same_cluster /total_same_cluster
recall_sum <- recall_sum + same_category_same_cluster / total_same_category
}
# Calculate average precision and recall
precision <- precision_sum / n
recall <- recall_sum / n
return(list(precision = precision, recall = recall))
}
# Calculate BCubed precision and recall
metrics <- calculate_bcubed_metrics(data)
# Extract precision and recall from the metrics
precision <- metrics$precision
recall <- metrics$recall
# Print the results
cat("BCubed Precision:", precision, "\n")
cat("BCubed Recall:", recall, "\n")
elbow method helps find the optimal number of clusters (k) in k-means clustering. the “elbow” point is where the rate of decrease in WSS slows, indicating a good balance between cluster count and cluster compactness. The goal is to select the smallest k that retains most of the data’s variability.
# 3- Elbow method
#fviz_nbclust() with within cluster sums of squares (wss) method
fviz_nbclust(data_for_cluster, kmeans, method = "wss") +
geom_vline(xintercept = 5, linetype = 2)+
labs(subtitle = "Elbow method")
WSS is to give you an indication of how well the data can be represented by a certain number of clusters. In k-means clustering, typically choose the number of clusters (k) that minimizes this total WSS.
for (k in 2:5) {
kmeans_result <- kmeans(water_potability, centers = k)
total_withinss <- kmeans_result$tot.withinss
cat("Total Within-Cluster Sum of Squares for k =", k, ":", total_withinss, "\n")
}
| Clustring1 | Clustring2 | Clustring3 | Clustring4 | |
|---|---|---|---|---|
| average silhouette width | 0.08 | 0.08 | 0.07 | 0.08 |
| BCubed Precision | 0.0053507 | 0.005417132 | 0.005466656 | 0.005525278 |
| BCubed Recall | 0.9501209 | 0.9334741 | 0.9250736 | 0.9201568 |
Total Within-Cluster Sum of Squares for k = 2 : 35435736178
Total Within-Cluster Sum of Squares for k = 3 : 18005636207
Total Within-Cluster Sum of Squares for k = 4 : 10644686913
Total Within-Cluster Sum of Squares for k = 5 : 7263150876
After testing and using all the four k means we concluded that the best is k= 5 for our dataset
A classification approach is more fitting for the task of determining water potability. The inherent design of classification models to make binary decisions aligns seamlessly with the goal of identifying whether water is safe for consumption. This focused predictive capability provides actionable insights crucial for effective water treatment decisions. The classification model directly communicates the safety status of a water sample, aiding in adherence to quality standards. While clustering serves well in exploratory analysis, it might not be as directly aligned with the precise objective of predicting water potability. Hence, for this specific task, a binary classification model is recommended.